Query2Vec: NLP Meets Databases for Generalized Workload Analytics
نویسندگان
چکیده
We consider methods for learning vector representations of SQL queries to support generalized workload analytics tasks, including workload summarization for index selection and predicting queries that will trigger memory errors. We consider vector representations of both raw SQL text and optimized query plans, and evaluate these methods on synthetic and real SQL workloads. We find that general algorithms based on vector representations can outperform existing approaches that rely on specialized features. For index recommendation, we cluster the vector representations to compress large workloads with no loss in performance from the recommended index. For error prediction, we train a classifier over learned vectors that can automatically relate subtle syntactic patterns with specific errors raised during query execution. Surprisingly, we also find that these methods enable transfer learning, where a model trained on one SQL corpus can be applied to an unrelated corpus and still enable good performance. We find that these general approaches, when trained on a large corpus of SQL queries, provides a robust foundation for a variety of workload analysis tasks and database features, without requiring application-specific feature engineering. PVLDB Reference Format: Shrainik Jain, Bill Howe, Jiaqi Yan, and Thierry Cruanes. Query2Vec: An Evaluation of NLP Techniques for Generalized Workload Analytics. PVLDB, 11 (5): xxxx-yyyy, 2018. DOI: https://doi.org/TBD
منابع مشابه
Query Clustering using Segment Specific Context Embeddings
This paper presents a novel query clustering approach to capture the broad interest areas of users querying search engines. We make use of recent advances in NLP word2vec and extend it to get query2vec, vector representations of queries, based on query contexts, obtained from the top search results for the query and use a highly scalable Divide & Merge clustering algorithm on top of the query v...
متن کاملWiSeDB: A Learning-based Workload Management Advisor for Cloud Databases
Workload management for cloud databases deals with the tasks of resource provisioning, query placement, and query scheduling in a manner that meets the application’s performance goals while minimizing the cost of using cloud resources. Existing solutions have approached these three challenges in isolation while aiming to optimize a single performance metric. In this paper, we introduce WiSeDB, ...
متن کاملGeneralized Snapshot Isolation and a Prefix-Consistent Implementation
Generalized snapshot isolation extends snapshot isolation as used in Oracle and other databases in a manner suitable for replicated databases. While (conventional) snapshot isolation requires that transactions observe the “latest” snapshot of the database, generalized snapshot isolation allows the use of “older” snapshots, facilitating a replicated implementation. We show that many of the desir...
متن کاملAll Schedules Dynamic
This paper presents the AAected Set Priority Ceiling (ASPC) concurrency control protocol for real-time object-oriented databases. The protocol is based on a combination of a semantic locking technique and priority ceiling techniques. The paper speciies six criteria for real-time concurrency control: high concurrency, deadlock prevention, predictability, temporal consistency enforcement, logical...
متن کاملIn-Memory Data Analytics on Coupled CPU-GPU Architectures
In the big data era, in-memory data analytics is an effective means of achieving high performance data processing and realizing the value of data in a timely manner. Efforts in this direction have been spent on various aspects, including in-memory algorithmic designs and system optimizations. In this paper, we propose to develop the next-generation in-memory relational database processing techn...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1801.05613 شماره
صفحات -
تاریخ انتشار 2018